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ABSTRACT: Many particle physics analyses which need to discriminate some background 
process from a signal ignore event- by-event resolutions of kinematic variables. Adding this 
information, as is done for missing momentum significance, can only improve the power of 
existing techniques. We therefore propose the use of significance variables which combine 
kinematic information with event-by-event resolutions. We begin by giving some explicit 
examples of constructing optimal significance variables. Then, we consider three applica- 
tions: new heavy gauge bosons, Higgs to rr, and direct stop squark pair production. We 
find that significance variables can provide additional discriminating power over the original 
kinematic variables: ~ 20% improvement over itit in the case of H — > rr case, and ~ 30% 
impovement over rriT2 m the case of the direct stop search. 
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1 Introduction 

There is a set of key observables which seem, hitherto, to have received scant to non- 
existent attention in the literature. These observables are the event-by- event resolutions 
of individual kinematic variables which constitute the building blocks of most analyses at 
present. Such analyses (which we will call "cut-based") will, for the foreseeable future, 
continue to be found in a large fraction of collider physics search papers, even though 
more powerful techniques are available. 1 One of the main reasons that cut-and-count usage 
remains strong, despite non-optimality, is the perceived simplicity with which "reasonable" 
analyses can be developed. Against this backdrop we should ask: "How can event-by-event 
resolutions be used effectively within current analyses without fundamentally changing the 
way they are done?" 

1 In the appropriate context, any technique which make sensible and full use of the joint likelihood of 
the data as a function of all relevant parameters cannot be beaten. 
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2 A concrete example 



Consider a kinematic variable m which, in the absence of new physics and detector resolu- 
tions, has a classical maximum M. For example, m could be transverse momentum or the 
actual mass of some system of particles. The usual procedure for using m is to place a cut 
value m cu t and then to count the number of events for which m > m cu t- If this number 
significantly exceeds expectation, then one has evidence for new physics. However, one can 
do better than this by including more information such as event-by-event resolutions (and 
the mass scale M). For example, consider the probability Pm that the measured value 
m observed £ Qr a £ xe j even t exceeds the scale M. Symbolically, this is 



P M = p r (m( re ) measured > M\R m ), (2.1) 

where R m is the resolution function 2 p( m ( re ) measured | m observed ). p or general purposes, one 
assumes that R m is a Gaussian function centered at the measured value with a width given 
by <7 m . In this case, we can explicitly compute Pm, as in Eq. 2.2. 

poo 

P M = / p(m (re)measured |i? m )dm (re)measured (2.2) 
Jm 

(re)measured ^ observed ^2 ^ 

^ m ( re) measured 




Since the erf function is monotonic and smooth, the complete behavior of Pm is determined 
by the quantity 



m observed _ M 

X M = • (2.3) 

Perhaps surprisingly, very few analyses seem to use quantities like Xm- In fact, so 
far as the authors are aware, the only variable of this type that has seen significant usage 
in the collider literature is the "i?™ lss significance", not to be confused with E™ 1SS . The 
latter is the magnitude of the transverse momentum necessary for conservation in the plane 
perpendicular to the beam whereas E™ 1SS significance, first constructed at D0 [1], in its 
most complete form usually refers to the log of a likelihood ratio 

where p($T = x ) 15 the probability density for remeasured valued of the missing transverse 
energy. The purpose of E™ ms significance is to differentiate events with real missing energy 



2 p will be the generic symbol for a probability density function. 
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from invisible particles like neutrinos from those without, and it is constructed from the 
resolution functions of all the objects used to construct the E™ 1SS itself. 

( measured \ ^ 
$T ) l^ a \ 

In general, it can be tedious to precisely determine a on an event-by-event basis. Therefore, 
one observes [2, 3] that a^ T oc \/Ht, the scalar sum of the visible pt in the event. Then, 
an approximate E™ 1SS significance may be written as a monotonic function of (E™ 1SS ) 2 / ' 
and in fact, the most commonly used choice is E™ ss /\/Ht- 

We note that the approximate E™ 1SS significance defined above is a realisation of Xm 
in which (i) M = 0, (ii) we assume a Gaussian resolution function centered at the measured 
Ef ss , and (iii) a oc sfHr. 

Even though E™ 1ss /\/Ht and E™ 1SS and are correlated, one can gain statistical power 
by considering E™ ss /\/Ht in addition to or instead of E™ 1SS itself. This has been shown 
in analyses spanning a wide range of physics processes including Standard Model measure- 
ments [6-8, 11, 12] and searches for the Higgs Boson [10], Dark Matter [9], and Supersym- 
metric particles [4, 5]. 

Motivated by the gains found by using the missing energy significance E™ ss /^Ht 
in addition to E™ ss , we want to see whether similar profits are to be had from building 
significance related quantities for other kinematic variables. 



3 Significance variables 

There are many ways that cut-based analyses could be modified to make good use of event- 
by-event resolutions. The least prescriptive (and in some cases least effective) method 
simply adds to each event the resolutions as additional variables in their own right upon 
which to make cuts. Indeed, simply doing this and leaving a Multivariate Analysis (MVA) 
tool to find the best way of using the additional information will appeal to many. 3 

However, readers will have noted that the physics of the preceding example of E™ 1SS sig- 
nificance motivated the formation of a very particular combination of the kinematic variable 
and its associated resolution into a single quantity, equivalent to the significance variable 
Xm, which may contain all of the relevant discriminatory information. We would like to 
show that it is not unusual for most of the relevant resolution information to be condensed 
into a single simple X-like variable. Furthermore, we will show that it is even common- 
place under certain conditions - principally those in which the signal and backgrounds are 
associated with different mass or energy scales. 

Knowing that variables like Xm frequently contain most of the relevant resolution 
information is useful. It means that a user keen to see whether an analysis can benefit from 
incorporating resolution information has a straightforward way of testing whether it might 
help. For each event, using the description below, one can compute a Xm significance 
variable for the kinematic quantity of interest, and then try placing a cut on Xm instead of 
(or perhaps in addition to) the cut on the kinematic variable on which his Xm was based. 

3 It is straightforward to show (see Appendix C) that the optimal way of making use of the information 
in a cut-based analysis is always equivalent to a cut on the ratio of the likelihoods of the event under the 
signal and background hypotheses, and MVA tools can often get pretty close to such cuts. 
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If it is desired to include resolution information in an analysis, the work necessary to 
compute that resolution an any particular kinematic variable is unavoidable, and specific 
to the analysis in question. However it is important to note (i) that this work is the 
same regardless of whether the resolution be used in an MVA or in the construction of 
an Xjvj-like significance variable, and (ii) that the construction of an Xj^-like significance 
variable is itself very simple, requiring only a subtraction, a division and the choice of a 
signal-background separation scale M. Given that -Xjvf-like variables are frequently close to 
optimal (as we show below) there seems little reason to avoid adding them to our toolkits. 

Finally, before moving on to specific examples, we not the Xm itself will not always 
be the optimal significance variable for an analysis. Any case in which resolutions are 
significantly non-Gaussian may require, for optimality, the use of a significance variable 
based on the likelihood ratio as described in Appendix C, or the use of an MVA tool to 
approximate the likelihood ratio procedure. Nonetheless, our key message is that many 
analyses could make use of resolution information at the event-by-event level which they 
are presently throwing away, and that even if they do nothing else, analyses should consider 
using this information. A simple way of using it, that captures most of the information 
thrown away is contained in an Xjvf-like significance variable, but where this is non-optimal, 
the resolution information can and should still be used either with an MVA or a dedicated 
derivation of the optimal significance variable(s) for the analysis in question. 

4 Some worked examples of optimal significance variables in toy models 
4.1 The simplest case of all — Gaussian resolution 

Consider a search for a physics processes using a single kinematic variable m. Using the 
significance metric s(c) = s/y/b, for c a cut value, we can ask the question how does 
max c s change if we also include some measure of the resolution on m? In other words, 
what is the optimal combination of m and a m to maximize the significance metric s? To 
begin, consider a simple model in which the variable m has a delta function distribution, 
{l/N)drrii/dN = 5{m — Mi), where i 6 {s,b} (signal/background). For example, suppose 
that m = m,T in a class search for a heavy gauge boson in the letpon+missing energy 
channel. Due to the Jacobian peak, most of the probability for m is near M,, and so this 
simple model may capture some aspects of the analysis. Let the resolution functions of m 
be Gaussian with width a. Then, 



where g(a) is the distribution of a. We assume that g is not a delta function, otherwise the 
resolution information does not tell us anything. For the reasons set out in Appendix C, the 
optimal cut boundary on a combination of m and a is a cut on the ratio p s (m, a)/pb(m, a). 
Dividing the probably functions from above and monotonically transforming the answer 
brings us to the conclusion that an appropriately chosen cut on the significance variable 





r (Gaussian 
opt 



) m - (Ms + M b )/2 
a 2 
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cannot be beaten. We note that this significance variable is very similar to Xm (with 
M = (M s + Mb)/ 2) and only differs in the use of the variance instead of the standard 
deviation of the uncertainty in the denominator. 

4.2 More realistic asymmetric resolutions 

We now consider a variant of the previous example. Up until now, we have studied only 
symmetric resolution smearing. However, due to falling prior kinematic spectra, more 
generally we might expect asymmetric resolution functions. Consider for example a Gumbel 
distribution for the resolution function: 

1 fm-MA ( ( m-Mi \\ 
Pi{m) = -expi — - — lexpl-expl — - — H (4.3) 

We choose this probability density function because with the identification a = —7=0, to 

V 2"7T 



second order in the Taylor expansion, the Gumbel and the Gaussian are the same. The 
asymmetry in the Gumbel then is present at the third order. In the above parameterization, 
the tail for the Gumbel is heavier on the left than the right, which represents the generic 
case in which events are more likely to have smeared from lower values due to falling priors. 
As we saw in the previous example, it does not matter what weighting function we add to 
multiply pi by, so long as it does not depend on i, and this time we find that an appropriately 
chosen cut on 

^ r -. =ex p(!^)-e x p(^) + Mi_^ (4.4) 

cannot be bettered for discrimination of signal from background in this model. 

The lines of constant p s /Pb (equivalently the lines of constant V^S umbeI ^) are richer than 
for the Gaussian case. In the uninteresting case where m <C M& (and thus also m <C M s 
as we will assume, without loss of generality, that there is a hierarchy of scales M s > M&), 
we have that the uncertainty parameter is the optimal cut value (i.e. m does not give 
any information). Since one looks at counts which exceed bounds, we are interested more 
in the kinematic maxima and thus when m ~ Mi and when m > Mi. If Mb < m < M s , 
then the expression above reduces to the variable X with M = Mb- Likewise, if m > M s 
and is small compared M s — Mb- For m > M s and small compared M s — Mf,, both 
exponentials are large and we can reduce the expression to 

e ^^) stah (^^) =constalrt (4.5) 

where M is the average of M s and Mb- The sinh term is relatively smaller and slowly 
varying and thus this is simply X with M = M. Figure 1 shows a plot of p s (m, 0)/pb(m, 0) 
for Mb = 80 and M s = 85. The level sets of Figure 1 correspond to the optimal combination 
of m and 0. Straight lines indicate that X is the optimal variable. One can clearly see that 
for m > Mb-, the level sets are straight lines and thus some form of X is optimal. 



- 5 - 



1 



70 75 80 85 90 95 
m 



Figure 1. Contours of constant p s (m, /3) /pb(Tn, /?) (equivalently lines of constant V^ pt um ° ) in the 
(to, /3) plane for M& — 80 and M s ~ 85. We can see that for m > Mb the contours are straight lines 
and thus X is the optimal variable. 

4.3 Choosing the separation scale M 

The above constructions shows that M can play a dynamic role in the definition of X. The 
interpretation of M as the scale of Standard Model physics does not require that it be fixed 
ahead of time, since detector resolutions can distort the reconstructed scale away from the 
true scale. We can further quantify the dependance of X on M by studying the efficacy of 
X over m with respect to sj\fb. 

Proposition 1. The maximum significance for Xm, taken over all values of M, can be no 
worse than the maximum significance of m itself 

Proof. Suppose tha & is a cut value on m such that s(k) = max c s for m. Then, let M = k 
and then a cut of X = will reproduce the same significance as s(k). □ 

Corollary 1. There is no reason to be afraid of using Xm instead of m since (provided 
the value of M is chosen sensibly) an Xu-only analysis cannot be worse than an than an 
m-only analysis. 

Now, consider a kinematic variable m with zero resolution maximum rh. The value of 
M which maximizes max c Sx(M)( c ) need not be equal to rh. Obviously, if a is constant over 
all events, X induces the same ordering on events as m and so any value of M maximizes 
s. Intuitively, it would seem like for varying resolutions, the optimal M should be greater 
than m, but this need not be the case. 

Proposition 2. Consider a kinematic variable m with zero resolution maximum rh. The 
optimal value of M may be less than rh. 

Proof. Consider the model in Eq. 4.1. We know that if the distribution of a{m) is also 
a delta function, then X and m will give the same significance. Therefore, take a simple 
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Figure 2. These plots illustrate the distributions of m, X and s for a simple model in which m is 
always 'on shell' at 80 for the background and 90 for the signal. The resolutions can take one of 
two values with probability 1/2. independent of the physics process. 



where Oi are two fixed values of a and p £ [0,1]. Note that we assume that a is independent 
of m. With this simple model, we can easily compute the distributions of m, X and s, as 
seen in Figure 2 for rh = 80 for the background, rh = 90 for the signal, p = 1/2 and p is 
the signal efficiency, defined by p(c) = / c °° dxf(x) for f(x) the signal probability density 
function and c a cut value. In this setup, we can see that there is an M < rh which 
outperforms the significance at M = rh. This is seen clearly in the second plot of the figure 
in which the low value of M can allow for X to distinguish between low and high resolution 
events for the signal. In the limit as rh — M > a, X will be able to distinguish the low and 
high resolution events, thus increasing s. For rn - M > tr, the efficacy of X approaches 
the constant resolution case and so one cannot gain more by decreasing M. □ 

For further properties of X and related variables, including a discussion of computation, 
see Appendix A. 

5 Performance in fully simulated examples of physical interest 

Using PYTHIA 8.170 [14-16], we simulate the distributions of Xm 4 in canonical searches 
that use the variables m = mx and m = rriT2- 

4 We do not show Pm because we are assuming Gaussian resolution functions and thus Xm captures all 
the information in Pm- Furthermore, as noted in Appendix A, Pm is very expensive to compute in the tails 
of the distributions, which are the most important regions for searches for new physics. The variables Qm 
and Ym (c.f. Appendix A) require model dependance and are in general more involved to compute and we 
find in the cases we examined that there is not significant benefit over Xm- 



extension: 




(4.6) 



5.1 W' (new gaugue boson), transverse mass significance 

The transverse mass was first used in the discovery of the W boson and the measurement 
of its mass at CERN by the UA1 collaboration [17]. Defined by 5.1, m-r has the property 
that rriT < m\\r. Since its first use, mj- continues to be used for precise measurements of 
the W boson mass, as well as in searches for new physics. For example my is actively in 
use to search for new gauge bosons like the W 1 [18, 19]. We therefore use a W' search with 
niT as a model system to construct the transverse mass significance. We concentrate our 
attention on the leptonic W/W decays so that the resolution function is determined almost 
entirely by the resolution in the missing momentum vector. In this search, the W mass 
is a natural choice for M in constructing Xm- In our Monte Carlo study, we simulate pp 
collisions at -y/s = 14 TeV. The W boson is created with a mass 5 of 100 GeV and the same 
CKM matrix as the Standard Model W boson. The resolution of the missing momentum 
was modeled as a^ y = 0.5-y/X] Et, where ^Et is the sum of all visible momentum and 
follows the measured spectra in dijets [2]. The distributions of %, Xm and s are shown in 
Fig. 3. The various rows of Fig. 3 demonstrate the affect of the W width on the efficacy 
of Xm- We can see that for a vary narrow resonance background, Xm is much better than 
wit, but as the width becomes large, the advantage decreases. 

ml = ml + mi pton + 2 + ^< pton + (p£ pton ) 2 ~$r " (^) 

5.2 H — > tt, transverse mass significance 

Another possible use of the significance is in the standard H — >• tt search (measure- 
ment) [21, 22]. In the dilepton channel, the dominant background is Z boson production 
and so the natural value for M is 90 GeV. Figure 4 shows the distributions of (between 
the total missing transverse momentum and the two lepton composite system), Xm, and s 
for a 125 GeV Higgs. The optimal value of M was found to be less than 90, as indicated 
in the diagram. The s figure shows that there can be a significant improvement from Xm 
over niT- 

5.3 Pair production of light stops, pp — > ttX, stransverse mass significance 

The transverse mass is very effective when there is one missing particle in an event topology, 
such as a neutrino. However, with pair production of missing particles, additional consid- 
erations are required. One natural generalization of tut is the variable m,T2 [20], defined by 
Eq. 5.2 for a symmetric event topology involving one visible particle and one missing par- 
ticle in each branch. The missing particle in branch x £ { a > b} has transverse momentum 
PTx and rriT x is the transverse mass of one branch formed by the corresponding missing 
particle momentum and the measured visible particle momentum. Further generalizations 
of the 7TT-T2 variable have been studied and applied to Tevatron and LHC data for mass 
measurements and searches for new physics. For example, consider direct stop squark pro- 
duction in imparity conserving SUSY. There is a lot of interest now at the LHC in searches 

excluded by [18, 19], useful here for illustration only 
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Figure 3. In each row, the left plot compares the transverse mass distribution for a Standard 
Model W and a W with mass 100 GeV. The middle plot is the corresponding distributions of Xm 
with M = 80 GeV. The right plot shows the rejection sy/b as a function of the signal efficiency, in 
arbitrary units. The bands show the statistical uncertainty due to limited Monte Carlo statistics. 
The top row has a boson mass width of 0, the middle has a width of 20%, and the bottom row has 
the full width. We can see that for this fixed value of M, the performance of Xm is better than m-j- 
for a narrow width and then worse at higher width. By construction, Xm cannot be worse than 
ttit and thus the optimal M in the last row must be different than 80. The inset plot shows Xm 
for M — 100, for which the performance of X and % is the same. 

for these signatures for light stop squarks with all the other sparticles very heavy. One 
such search in ATLAS uses rriT2 m the dileptonic channel [23]. It is this model that we use 
as our testing ground to construct the stransverse mass significance. With the leptons as 
the visible particles in the definition of m,T2, this system once again has the feature that 
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Figure 4. The left plot is the mj distribution for dileptonic Z — > tt and — > rr for a 125 GeV 
Higgs. The middle plot is the corresponding X curve with M=60 and the right plot is the rejection 
versus efficiency relationship. 




Figure 5. The left plot is the mx2 distribution for for dileptonic tt and t — > t + LSP for a 350 GeV 
stop and 170 GeV LSP. The middle plot is the corresponding X curve with M=80 and the right 
plot is the rejection versus efficiency relationship. 

the resolution is mostly due to the missing momentum vector. Since tt is the dominant 
background, we take M = 80 GeV. Here, we only consider the decay t — > t + LSP. The 
rriT2 distribution, stransverse mass significance, and s are shown in Fig. 5 for a compressed 
scenario of m s t op = 350 GeV and wilsp = 170 GeV. 

rriT2= rnin {max(mr a , rriTb)} (5-2) 

6 Conclusions 

Given any bounded kinematic variable m, we have constructed the significance variable Xm 
and its variants Ym , Pm an d Q m which generalize the idea of missing transverse momentum 
significance. We have proved that (for an appropriate choice of M) the significance variable 
Xm, alone, cannot perform worse than the variable m upon which it is based. We have found 
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concrete and physically interesting examples of the significance variable Xm performing 
better than tut or m^2 as a discrimination variable. In particular, for H — > tt we find that 
Xm can outperform with respect to s/y/b by ~ 20% and for direct stop production 
Xm is better than m,T2 by ~ 30%. 

Even though we have seen improvements from Xm in some standard applications of 
bounded kinematic variables, the main purpose of this paper is to make a case that event- 
by-event resolution information should be included in all analyses. The Xjvf-like significance 
variables provide a simple algorithm that may capture most of the relevant discriminatory 
information. When Xm is not (nearly) optimal, the resolution information should be in- 
tegrated into analyses with an MVA or a dedicated derivation of the optimal significance 
variable(s) for the analysis in question. We hope that significance variables will now become 
part of the experimentalists standard toolbox. 
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A Properties of X M and related variables 

To begin, we need to show that the significance variable Xm does indeed add new infor- 
mation over a search using m alone. This is not obvious, since it is often the case that the 
resolution of to is uncorrelated with the underlying physics process. In other words, the 
distribution of <t(to) is the same for both signal and background. Therefore, on its own 
cx(m) does not provide any useful information. To quantify the statement that Xm adds 
new information, we can show that if some event i lands in the tail region of m, it need not 
be in the tail region of Xm- 



Proposition 3. For N events, if m induces an ordering on the events such that mi < 

-M ^ ^M ^ - - - ^ -"~M 



■m 2 < • • • < mj\T; then it is not necessarily true that X^) < X^J < • • • < x[ N/ 



Proof. We can show this simply by demonstrating the M dependance of Xm- It is easiest to 
see when N = 2 and to view Xm as a function of M. There are two possible configurations, 
as illustrated in Figure 6. In (Xm,M) space, Xm is a linearly decreasing function of M. 
The quantity which controls the ordering of Xm is A = (772-2 o"i — mi^) / (&i ~ °~2)- When 
A < or infinite in magnitude, then X^) > xff for all M. However, if A > 0, then 
there is a critical M* such that for M < M*, X$ > X M ] for M > M*, X M ] < X$ . 
The value of M* is A. For N > 2, the situation is more complicated, but the result is the 
same; different values of M can rearrange the distribution of events based on Xm from the 
distribution based on M. One can generalize the plots in Figure 6 for N > 2. Note that 
the distribution of points of intersection with the M axis forms the observed distribution 
of to. □ 
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ni2 mi m% m\ 



Figure 6. The dependance of Xm on M for two events with A = (m^cri — rn\a-i)j{p\ — a 2) > 
in the left plot and A G [—00, 0) U {00} in the right plot. 

Now, we return to the original motivation for constructing a new variable from m. We 
observed that in the absence of detector resolution, m has a kinematic maximum M. If 
we let m tme denote the value of m that we would observe given a delta function response 
function from the detector, then this means that the probability that Pr(m true > M) = 0. 
We therefore are motivated to try to compute the probability that m tvue > M for a given 
event since this is zero for the Standard Model background. However, since we do not know 
the true value, the best we can do is compute 



Q M = Pr(m true > M|m observed ). (A.l) 

At first, it may seem like Qm and Pm (from Eq. 2.1) contain the same information, 
but in fact this is not the case. 

Proposition 4. If Pm induces an ordering on N events given by P^j < P^ < • • • < Pjl ■ > 
then it is not necessarily the case that Q M ^ < Q M ^ < • • • < ■ 

Proof. To see this, consider the case in which N = 2. Then, we can compute the difference 
Q*M ~ Qm an d relate it to P$ ~ Pm r • Even in the case in which R is a Gaussian, the 
quantity: 

roc 

Qm ~ Qm = / [p(m true |m° bserved ) - p(m tme |m° bserved )] dm tvue (A.2) 



M 

1 



p(mf served ) J M 



n { m observed \ 

p(ml hse " ed \m tme ) - observed P( m 2 bSe " ed I mtme ' 

P\ m 2 ) 



p(m tIue )dm tIue 



is not necessarily positive given that p£P — pff is positive. In this case, Xj^ — Xj^ deter- 
mines [p(m° bserved |m true ) — p(m2 bserved |m true )] . However, because the ratio of probabilities 
multiplying the second term in the second line of Eq. A.2 could be important and since Xm 
has M dependance, the integral does not just depend on the values of p( m ODServed | m true ) a t 
the endpoints {m, 00} due to the weighting function p(m tvue ). □ 
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Just as we formed Xm out of Pm (Eq. 2.1), we could form a variable Ym from Qm of 
the form 



m observed _ M 



where R' = p(m true |m observed ). In the case that R' is a Gaussian, this completely 
determines the behavior of Qm hi the sense that both Qm and Ym induce the same ordering 
of events. However, due to falling prior distributions, it is not often the case that R' is 
exactly Gaussian, though Ym is still useful because it is easier to compute than Qm- Even 
though both Qm and Ym aim to probe the truth structure of an event, one drawback is 
that they both require knowledge of the prior p(m tme ). We cannot get this distribution 
from the observed data, instead relying on Monte Carlo simulations. 



B Computation of Xm, Ym, Pm and Qm 



First, we consider the Gaussian variable Xm- Jet and lepton responses are parametrized 
as a function of their coordinates in (t],pt) space. This response is defined to be the ratio 
pobserved^/ptrue go we ^ave access to the variance of p(py bserved |py ue ). For Xm, however, 
we would like to know the width of p(p^ mea ' snTed \p^' served ). For ease of notation, let 
p = pC re ) measured ^ _ pmeasured^ T _ ptrae Tj s j n g the law of total probability and Bayes' 
Law, we can expand p(p\p) as in Eq. B.l. 



p(p|M) = / p(p\iM,T)p(r\iJ,)dT 



p(p\T)p(T\n)dT 



(B.l) 



/ | Mv\t)p{t) 

p ^ t) Km) r 



Now, suppose that we know the prior distribution p(r) in terms of a histogram: p(r) = 
a ifii( T ) where i = 1,...,N is the number of bins and 5{ is the indicator function on the 
bin i over range [ai,&i]. Then, in Eq. B.2, we insert this function into the results from 
Eq. B.l. In Eq. B.2, Gauss(a;, p, a) is a Gaussian with mean p and standard deviation a 
evaluated at x. 
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poo 

p(HaOp(aO = y2 a i / p(/>l r M^l r )^( r )*r ( B -2) 

= z2 ai p(p\ r )p(p\ T ) dT 

i Ja i 

= ai / Gauss(p, r, <r)Gauss(p, r, a) 
= aiGauss(p, r, v^u) 



(2ai-p-n\ rfZbi-p-p- 
eri — erf 



= Gauss (p, r, \/2cr) 
:= Gauss(p, r, V2a) «j( 



2ct 

2flj - p - p\ (2bi - p - p 

eri — erf 



2cr J V 2cr 



Now, we want to understand how (*) in Eq. B.2 varies with p, since we view p(p|p) as 
fixed in p and as a function of p. In Eq. B.3, we observe that the dependance of (*) in Eq. B.2 
on p goes to zero as aj — > b{ and thus to a good approximation, p(p|p) oc Gauss (p, p, y/2a). 
Practically then, to compute Xm, one must propagate these 'inflated' Gaussians into a 
formula for the resolution function of m. 

d(*) 

— — oc Gauss(2fcj, p + p, \[2~o) — Gauss(2dj, p + p, y2a). (B.3) 
dp 

If the Gaussian approximation for the resolution function is very good, then an analytic 
approximation using linear error propagation would be sufficient. However, to capture 
non-Gaussian attributes, numeric propagation may be necessary. In particular, if m is 
a mass-like variable with a restriction m > 0, the resolution function will necessarily be 
non-Gaussian near m = 0. In such cases, we can estimate how many random draws are 
necessary to accurately compute a m . If s 2 is the sample variance, then the variance of the 
sample variance is given by Eq. B.4, where k is the excess kurtosis [13]. 



Var[ S 2 ] = a 4 (-?— + -) . (B.4) 
\n — In/ 

For an absolute uncertainty on the standard deviation / and an 0(1) standard deviation, 
one needs 



2 + K + f 2 + ^4 + 4k + 4/ 2 + K 2 - 2f 2 K + / 4 



n = — 2 . (B.5) 



For / <1 and an order 1 or smaller k (this is zero for a Gaussian), 
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2 + k + y/i + 4k + k 2 3 

W 2 P 



n ~ ^ 72- ( B - 6 ) 



For example, one needs n ~ 300 for an accuracy of 0.1 GeV. The computation for Ym 
is similar to Xm, except instead of propagating the uncertainties from p(p\p), one must 
propagate the uncertainties from p(r\p), which requires the input of a prior distribution 
p(r). In general, these priors are expected to not be uniform and thus the propagation 
must be done numerically as linear error propagation may not be accurate. 

The computation of Pm and Qm may seem must harder than that of Xm and Ym- 
However, this may not be the case. To ease the notation, we recycle letters from earlier by 
letting p = m ( re ) measured anc } n = m measured Then, we can rewrite Pm as in Eq. B.7, where 
0(x) is the Heavyside step function and the expectation value in the last line is taken over 
the space with measure given by the conditional distribution p(p\p). 



P M := Pr(p > M\p) (B.7) 
= J Pr(p> M\p,p)p(p\p)dp 

= J Pr(p > M\p)p{p\p)dp 

= j Q(p-M)p{p\p)dp 
= (@(p-M)) {pM . 

The reason for the different representation of Pm in Eq. B.7 is that it gives rise to 
an intuitive method for its computation and an easy way to assess its uncertainty. Since 
Q(x) G {0, 1}, we can think of the expectation above as a Bernoulli random variable. If 
the real value of Pm is p, then the variance of the sample mean is p(l — p) jn and thus the 
uncertainty is on the order of y/p(l — p)/n. For an absolute uncertainty / on the mean p, 
then 



_ p(l-p) 0.5(1-0.5) 1 

n --/^- P ~W (B ' 8) 

For example, one needs n ~ 2500 for an absolute uncertainty of 0.01. We make a similar 
computation for Qm and note that Qm = (@(m tIue — M)^ mtIue ^ m0hBeived y The uncertainty 
bound on Qm is thus similar to Pm, except that one must input truth distributions when 
sampling. In order to meaningfully compare Xm and Pm, one needs a way of relating 
a given uncertainty on Xm to an uncertainty on Pm- We can do this quantifying the 
interpretation of Xm as a 'number of standard deviations beyond the endpoint', by using 
a map A : M — > [0, 1] given by Eq. B.9. Given A, we can ask how uncertainties in Xm 
translate to uncertainties in X(Xm), which we can take as the necessary level of precision 
needed on Pm- Figure 7 shows the relationship between ctXm and <rA for several values of 



- 15 - 




Figure 7. Using the map A between significance and probability, we can relate the absolute 
uncertainty on Xm to an absolute uncertainty on \{Xm ), which is the precision we would need on 
Pm to make a meaningful comparison. 

Xm- We can see that if Xm ~ 1, then a 10% uncertainty in Xm corresponds to ~ 0.05 
absolute uncertainty in Pm- However, if Xm ~ 4 then a 0.1 absolute uncertainty on Xm 
(~ 3%) then the required uncertainty on Pm is ~ 10~ 5 . Since the absolute scaling of Xm 
and Pm is the same, this shows that it is very expensive to compute Pm- Even though Pm 
can encode non-Gaussian features of resolution functions, the computation cost may not 
outweigh the benefit from the computationally cheap Xm- 

X(X M )= f dx Gauss(x,0, 1). (B.9) 
J-x 



C Optimum use of additional variables 

The conclusions of this appendix on the optimal use of variables are not new. However, it 
may be useful to review what appears in the literature to be 'common knowledge.' Assume 
an event is characterized by an observable x and an uncertainty a. In other words once 
an event is recorded, values for x and a would be immediately known. Note that below, 
x and a are treated simply as variables with a joint distribution p(x, a) with no particular 
use made of the concept of a as an uncertainty on a measurement made by the other, 
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though that interpretation is possible within the framework. Let x = {x, a} and consider 
an arbitrarily function /(x) which (in effect) defines a new variable. For example, X = X ~J^ 
is an example of such a function, this time containing a parameter M. 

Consider two processes s (signal) and b (background) that we want to distinguish. 
Signal events have a joint probability density function of the form p s (x), background events 
p s (x), and the mixture of both has distribution: p(x, A) = Ap s (x) + (1 — A)p&(x) where 
A € [0, 1] is the fraction of signal events. 

Given the processes s and b, we can construct many functions / and consider an analysis 
Af which takes Nt total events and selects a subset N < Nt for which / > 0. For each 
analysis, we can construct a measure of performance by computing the expected value (with 
respect to p) of some optimality metric K(N S , N b ) where N s + N b = N and N s is the number 
of true signal events of the N selected by Af. For example, K = N s /^/N b is a standard 
metric. An analysis Af is optimal with respect to K if no other choice of / produces a 
higher value of K. Optimal choices of / are not unique - we can take an optimal analysis 
Af and transform / by wrapping it within any function g that maps no n- negative values 
to non-negative values and maps negative values to negative values and produce the same 
analysis and thus the same K. The important parts of / are therefore (i) its zeros (which 
define the boundary between accepted and rejected events) and (ii) its sign as a function 
of x. We will see this fact (re)emerge from the mathematics later. 

Hereafter take /(x) to be an optimal choice of / for some K, and create a (possibly 
non-optimal) function g(x, fj,) = /(x) + fih{x) where h(x) is an arbitrary polluting function 
of x and jjl is a scalar parameter controlling the degree of non optimality of g. Clearly g 
becomes optimal when fj, = 0. Let 



A(m) = J Q(g(x,ti))Pi(x)dx., (C.l) 



for i £ {s, b} and G is the Heaviside step function. With this definition, the expected 
number of signal and background events for events total in an analysis using the possibly 
non-optimal discriminant g(fi) are given by N s = NXD S and Nf, = N(l — A)D(,, and so if 
K were to take the explicit form A'example = N s /y/Nb then we would have 



N(l-\) D b (p) ' 

Since g is optimal when \i = we know that = when evaluated at fi = 0, 

independent of the choice of h(x). Accordingly, a necessary condition for optimality of / 
(assuming that iV is non-zero and that A is neither zero nor one) is 

lD b (0)D' s (0)-\D s (0)D' b (0) = 
in the case that K = ATexample, or for arbitrary K would take the form 

k s D'M + K b D' b {0) = (C.2) 
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in which m = 



. Now we compute 

jU=0 



D'i(p) = j <J(/(x) + M / i (x)) K ( X )/ t (x)dx, (C.3) 

and note that we have freedom to choose any h(pc). We exercise that freedom by making 
the choice /i(x) = 5^ n \x — m) for some and arbitrary constant m, where n is the dimension 
of our m space. With this particular choice of /i(x), Eq. C.2 becomes: 

K s 6(f(m))p s (m) + K b 6(f(m))p b (m) = 0, 

or equivalently 

[5(/(m))] x [ Ks p s (m) + Kb p b (m)] = 0, (C.4) 

which must be true for any choice of m. The presence of the two separate terms (multiplied 
together) in Eq. C.4 reminds us of our earlier statements about which parts of / should 
matter. For one thing, it shows us that for all values of m which are off the boundary 
defined by f(m) = the first term (containing the delta function) is zero, and so off of 
this boundary, there are no special constraints on / deriving from k s , Kb, p s and p b . These 
parameters are only relevant insofar as they affect the location of the optional boundary 
/(x) = 0. We see that this optimal boundary is therefore controlled exclusively by the 
second of the two terms in Eq. C.4 and its equality to zero. The boundary determining 
condition from the second term alone can be re-written as the requirement 

P -4^ = -^, (C.5) 
p b {m) k s 

which (we recall) must be satisfied by all values of m which lie on the optimal boundary 
/(m) = 0. In particular, the lefthand side of Eq. C.5 is a function of m whereas the 
righthand side is not! Accordingly, the values of m that occupy the boundary must be 
exactly those for which 

p m = 

Pb{m) 

is a constant and equal to —Kb/n s . Effectively, therefore, we now have all we need to know 
to construct the optimal /(x). All we need to do is the following: 

1. Consider the 1-parameter family of curves in the {x,cr}-plane that satisfy p(x) = 

= const = p, and consider them to be indexed by this real parameter p. 

2. Treat each curve as defining a boundary between two regions of the plane, these 
regions being named and R~ respectively. 

3. Let R = {R+\p GM}U {Rp\p G ^} be the set of all such regions. 

4. For each region r G iJ calculate the fraction of signal events F s (r) expected to fall 
within r: 



F s (r) = J p s (x)dx 



-18- 



and calculate the same quantity for background events: 

Fb(r) = J^p b (x)dx. 

5. The optimal cut boundary /(x) = will be the boundary of the region r £ R for 
which F s (r) / Fb(r) equals the value of p which defined that region r. 
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